-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Optimizer] DQ + MatMul to MatMulNBits support #21180
Conversation
onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selector_action_transformer.cc
Outdated
Show resolved
Hide resolved
onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_actions.cc
Outdated
Show resolved
Hide resolved
/azp run Windows_CI / Windows-CUDA-12,Windows_SCA / Onnxruntime-SCA-training-CUDA |
No pipelines are associated with this pull request. |
onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_actions.cc
Outdated
Show resolved
Hide resolved
onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_actions.cc
Outdated
Show resolved
Hide resolved
onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_actions.cc
Outdated
Show resolved
Hide resolved
onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_actions.cc
Outdated
Show resolved
Hide resolved
} | ||
|
||
OrtThreadPoolParams to; | ||
auto tp = concurrency::CreateThreadPool(&onnxruntime::Env::Default(), to, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if we should create a new threadpool here. if we really need this to run with a threadpool, consider adjusting the APIs to pass in an existing threadpool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix in next iter
onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selectors.cc
Outdated
Show resolved
Hide resolved
onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selectors.cc
Outdated
Show resolved
Hide resolved
onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selectors.cc
Outdated
Show resolved
Hide resolved
onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selectors.cc
Show resolved
Hide resolved
onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selectors.cc
Outdated
Show resolved
Hide resolved
onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_actions.cc
Outdated
Show resolved
Hide resolved
// used together with DQMatMulNodeGroupSelector, which does the sanity check | ||
struct DQMatMulReplaceWithMatMulNBits : public Action { | ||
explicit DQMatMulReplaceWithMatMulNBits(int64_t accuracy_level); | ||
Status Run(Graph&, const NodesToOptimize& selected_nodes) const override; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ReplaceWithNew
would definitely be preferred.
Not sure why we seem to have a few things using Action directly (SplitReplaceWithQuant and MatMulReplaceWithQLinear), but the intention of the setup was to use something like ReplaceWithNew whenever possible so we're not duplication logic (and unnecessarily increasing binary size) for the common functionality.
onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selector_action_transformer.cc
Outdated
Show resolved
Hide resolved
onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selectors.h
Outdated
Show resolved
Hide resolved
onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_selectors.cc
Outdated
Show resolved
Hide resolved
onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_actions.cc
Outdated
Show resolved
Hide resolved
onnxruntime/core/optimizer/qdq_transformer/selectors_actions/qdq_actions.cc
Outdated
Show resolved
Hide resolved
b26e0ea
to
d2a0008
Compare
the CI error in ort-web is this test: Fails with: Unclear how we get there. |
6596e54
to
1ecf5c5
Compare
### Description This is a partial change from [fajin/qdqmatmulnbitstoolchain](#21180). The original PR is blocked by Web CI failures. MatMulNBits is a heavily optimized matmul operation. Currently a MatMul can be converted to MatMulNBits to speed up the model inference. However, MatMulNBits is an ORT only op. To make the graph compatible with ONNX ops and utilize MatMulNBits at the same time, we introduce Q/DQ support for MatMulNBits. To convert MatMul ops in a model to MatMulNBits: 1. use matmul_4bits_quantizer.py to convert MatMul to DQ + MatMul using QDQ mode. 2. In ORT session, DQ + MatMul is fused to MatMulNBits #### Note MatMulNBits assume B weight is uint4. When no zp is provided, zp defaults to 8, which is different from DQ. DQ defaults zp to 0 when no zp provided. And DQ supports int4. Therefore some conversions are introduced during DQ + MatMul --> MatMulNBits step. #### Perf Using QDQ format will increase the model initialization time and memory consumption. With current implement, model init time increased from ~4s to ~9s, and memory consumption increased from ~2.8GB to ~4.8GB. The memory increase is due to 1. in optimizer, after transpose the B weight, a in-memory tensor proto is created using protobuf's arena. 2. in finalize step, when saving initializer and prepacking, ORT arena is used to create buffers for initializers. The memory allocated by arenas cannot be fully deallocated. If disable ORT arena memory allocation, the memory consumptions of both QDQ format and original format are ~2.2GB. The time increase is mainly due to multiple memory copy, but can be further optimized. ### Motivation and Context Please see description for details.
Description
MatMulNBits is a heavily optimized matmul operation. Currently a MatMul can be converted to MatMulNBits to speed up the model inference. However, MatMulNBits is an ORT only op. To make the graph compatible with ONNX ops and utilize MatMulNBits at the same time, we introduce Q/DQ support for MatMulNBits.
To convert MatMul ops in a model to MatMulNBits:
Note
MatMulNBits assume B weight is uint4. When no zp is provided, zp defaults to 8, which is different from DQ. DQ defaults zp to 0 when no zp provided. And DQ supports int4. Therefore some conversions are introduced during DQ + MatMul --> MatMulNBits step.
Perf
Using QDQ format will increase the model initialization time and memory consumption. With current implement, model init time increased from ~4s to ~9s, and memory consumption increased from ~2.8GB to ~4.8GB.
The memory increase is due to
The memory allocated by arenas cannot be fully deallocated.
If disable ORT arena memory allocation, the memory consumptions of both QDQ format and original format are ~2.2GB.
The time increase is mainly due to multiple memory copy, but can be further optimized.
Motivation and Context
Please see description for details.